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Abstract 

A generalized single step stepwise mutation model (SMM) is developed that takes into 
account an arbitrary initial state to a certain partial difference equation. This is solved in 
both the approximate continuum limit and the more exact discrete form. A time evolution 
model is developed for Y DNA or mtDNA that takes into account the reflective boundary 
modeling minimum microsatellite length and the original difference equation. A 
comparison is made between the more widely known continuum Gaussian model and a 
discrete model, which is based on modified Bessel functions of the first kind. A 
correction is made to the SMM model for the probability that two individuals are related 
that takes into account a reflecting boundary modeling minimum microsatellite length. 
This method is generalized to take into account the general n-step model and exact 
solutions are found. A new model is proposed for the step distribution. 

Introduction 

Microsatellites are a special class of tandem repeats of strings in the genome consisting of 
one to six bases (1-6 bp). These short strings can be repeated up to about 100 times. 

These simple tandem repeats (STRs) are often used to determine relatedness between 
individuals and even between species. In each generation there is a small probability 
there will be a copy mistake (mutation) in the resulting microsatellite length. Those STRs 
with the highest mutation rates are most useful for predicting time to a common ancestor. 

There are many different approaches to solving this problem mathematically. The typical 
approach used in modem times is to use direct statistical methods. The approach we take 
is the older one used by OHTA and KIMURA (1973) and others which is to treat random 
genetic drift as a diffusion process and solve the resulting differential equations directly. 
The main disadvantage to this approach in the past has been that one must make a 
continuum approximation, which makes it less accurate in certain situations. This means 
that method is not always very useful for determining the relatedness of very closely 
related individuals. This particular issue is dealt with here by deriving discrete partial 
difference equations in the form of discrete diffusion equations and solving them instead, 
thereby avoiding the issues associated with continuum approximations. 

The discrete diffusion equation is widely used in many other contexts. For instance it has 
been applied to the area of population growth (LU and TAKEUCHI, 1993) where one 
wishes to model geographic spread in addition to growth in number. In the area of 



physics it is used to model ionic diffusion on a lattice (FATH 1998) It has also been used 
in digital filtering in the form of diffusion filtering (LINDEBERG 1990). 

A difference equation for the SMM 

Let N(m,n) be the number of individuals with a particular STR marker having exactly m 
repeats at the nth generation. Let us suppose this marker can mutate at a rate r per 
generation. The probability of N[m,n] mutating to m+1 or m-1 is r/2. If N(m,n) mutates, 
there is an equal loss of the unmutated state. This suggests a difference equation of the 
following form 

N(m,n + l) = N(m,n) + N(m + \,n) -2N(m,n) + N(m -!,«)). (1) 


This is a form of the discrete diffusion equation. If we instead consider the asymmetric 
case where the probability of mutating to m+1 (r) is different from mutating to m-1 (r’) 
we have 


N(m,n + l) = N(m,n ) + N(m + l,n) -2N( m,n) + N(m-l,n)) + 


r-r 


( N(m,n ) - N(m - l,n)) 


( 2 ) 


which is the more general case of a discrete form of the Fokker-Planck equation (REIF, 
1965) of statistical mechanics. As demonstrated here, the discrete diffusion equation 
comes about in a very natural way. 

The SMM in the continuum 

In the continuum limit this becomes 




(3) 


This is the well known diffusion/heat equation (REIF, 1965) and is easily solved using 
standard techniques. A solution localized at t=0 is 


N(x,t) = 


1 


v/2 


Jtrt 


.exp 


(x-x 0 ) 2 


2rt 


(4) 


This represents the solution where each individual is descended from a single source. A 
similar approach has been applied using the SMM model in the continuous limit when 
comparing the distance between two distinct populations (ZHIVOTOVSKY 1995) using 
the standard deviation as the independent variable. The variance for (4) is given by, 

o 1 = rt. 


(5) 



This is a well known result for estimating age (GOLDSTEIN, 1995). A more general 
solution to (3) is (COURANT and HILBERT 1962) 


N(x,t) 


*\ Jlitrt 


jf A(x',0)exp 


(x - x') 2 


2rt 


dx' 


( 6 ) 


With an initial state of 


A(x,0) = <5(x - x 0 ), 


(?) 


(4) is recovered. 

The interpretation of (6) is that given an initial distribution, N(x,0), which may or may 
not be normalized, how does that initial distribution evolve over time? 

The semi-discrete method 

An alternative to (3) is where we allow the x variable to be discrete and the time variable 
to be continuous. In this case we have 

j^N(m,t) = ^(N(m + l,t)- 2 N(m, t) + N(m- 1, t )) . (8) 

It can be shown a solution to (8) is 


N(rn,t) = exp[-rt]l m _ mo (rt), 


( 9 ) 


where I is the modified Bessel function of the first kind (ABRAMOWITZ and 
STEGUN). This is a result that has been found using statistical methods with the SMM 
model (WALSH 2001). A more general solution to (9) may be found using Fourier series 
techniques. We first multiply both sides by exp[I m co ] and sum over m. The result is 


N[co, t) = ^ (exp(ico)N(oj, t ) - 2 N{a>, t ) + exp (-/m)iV(a), t) j 


-2rsin 2 



where 


N[co,t)= Vexp (i(om)N{m,t) (11) 

m=- oo 

is the generating function for the solution to (8). To recover the solution from the 
generating function we use 



1 ^ 

N(m,t ) = — J exp {-i(jom)N{(jo,t)d(jo. 
2ji j 


Equation (10) is easily solved and has the solution 


N(co,t) = iV(ca,0)exp 


-2rsuf 


co , 
2 1 ' 


Using the results of (1 1), (12), and (13) we find 

1 ” n 

N(m,t) = — 2 N(m',0)J exp[-ico(m'-m)]exp 


-2rshf 


co , 
2* 


dco. 


(12) 


(13) 


(14) 


Using the identity sin 2 (x)=(l-cos(2x))/2 and AB ROMO WITZ and STEGUN we find 


—— f exp(-zwm)exp 
2jt ^ 


-2rshT 


co , 
~2 1 


doj = exp(-rt)l m (rt). 


(15) 


We have then 


N(m,t) = exp[-/t] 2 N(rn',0)l m _ m \rt ). (16) 

m'=-° o 

The extension and uniqueness of this type of solution into the discrete domain has been 
demonstrated by others in a different context (LINDEBERG, 1990). In reality m can 
never be negative so the initial state will only include non-zero values for m’>0. As long 
as the initial state is sufficiently localized and far from m’=0 (as is typically the case) 
there are no boundary effects. 

It should be noted that Fourier series techniques have been applied in the past (FATH 
1998) to solve the discrete reaction-diffusion equation. The main difference here is that 
we are using the difference equation to solve for a generating function. 

The discrete method 

To solve (2) directly we multiply both sides by exp[I men] and sum over m as in the 
previous example. The result is 




(17) 


By recursion we find 



(18) 


N(co,n) = 


1 „ ■ z, CO 

l-2rsin — 


N(co, 0). 


Although we could integrate this expression in terms of a finite series, since r is typically 
very small, we instead use the approximation 


l-2rsin 2 




-2 rn sin 2 



(19) 


Using the results of (1 1) and (12) we find using the same techniques as in the previous 
section, 


oo 

N(m,n) = exp[-rn] ^ N(tn',0)l m _ m ,(rn). (20) 

m'=- oo 


Reflecting boundary 

As stated earlier the microsatellite length, m, must be greater than zero in real life. Once 
m reaches some lower limit (let us pick 0 for now) it can get no smaller. What this does is 
create a special reflecting boundary for our PDEs. Let us first consider the continuous 
case. Our boundary condition is then 


N[x,t) = N(-x,t). 


( 21 ) 


The reasoning being that if we force the solution to be symmetric there can be no net 
flow across the x=0 boundary. This suggests a solution of the form, 


2 

N[x,t ) = —J a[a),t)cos[(ox)dco , 

n n 


( 22 ) 


where 


a(co,t) = f N[ x ,t)cos(oox)dx . 

0 


We first use (23) and substitute it in to (4). The result is, 




(23) 


( 24 ) 


The solution is, 



a{m,t ) = a(o>,0)exp 


/ r 2 ^ 

— co t 

2 


(25) 


so that 


N(x,t ) = —f a(«,0)expf-^« 2 tjcos(o>x)dm. 


(26) 


This can be cast into the form, 

oo 

N[x,t ) = J N(x' ,0)G(x,x' ,t)dx' (27) 

o 

To find G(x,y,t) we use N(x,0)=<5(x - v) so that a(co,0) = cos {coy) . From this we find 


2 °r l v 

G(x,y,t ) = — J exp — <x> 2 t cos(<x>y)cos((ox)dco = 

JT „ ' 0 


1 


■\l2itrt 


exp 


(*-y) 


2rt 


+ exp 


/ (x + yf 


/ 


2rt 


(28) 


We now have 




exp 


^ (x-x 1 ) 2 '' 


2rt 


+ exp 


(x + x') 


2 V 


2rt 


k/x' 


(29) 


As an example we choose an initial state N(x,0)=<5(x - c) so that 


N(x,t) = 




Jtrt 


exp 


(x-cf 


2rt 


+ exp 


/ / ^ \2\ 

(x + c) 

2rt 


(30) 


The discrete version can be solved in a similar manner. The result is, 


N(m,t) = exp(-rf) ^(m',0)(/ B . B ,(rt) + I m+m {rt)),rn > 1 

m '= 0 

00 

N(m,t ) = exp(-rt) ^ N(m' , ,m = 0 

m '= 0 


(31) 


The general principal is that the kemal must satisfy the original PDE, must be symmetric 
about m=0, and N must satisfy the initial state at n=0. It is interesting to note that for m>0 



the kemal in both the continuum and discrete case with a reflecting boundary is just twice 
the even part of the kemal of the non-reflecting boundary. With an absorptive boundary it 
would be twice the odd part. 

The main advantage (31) has over the continuum diffusive model is better accuracy. For 
large times (r t»l) they are very similar. As an example the discrete counterpart to (30) 
is 


A plot of the continuous and discrete solution is shown in Figure 1. This result matches 
favorably with experimental results, which show a positive skewed curve (CALABRESE 
et. al., 2001). Some have also treated an upper bound as a reflecting boundary (NAUTA 
and WEISSING, 1996) in order to model maximum microsatellite length. Although this 
limits the microsatellite length, it is not clear whether that particular model has any basis 
in physical reality. There is, on the other hand, a real hard limit on the lower end, which 
cannot be violated. As a result we feel a reflecting boundary on the lower limit is a valid 
boundary condition. 



(32) 
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Figure 1. A comparison of continuous and discrete solution for various large times and 
c=10 with a reflective boundary shows that the continuous and discrete solutions are very 
similar. The dots represent the discrete solution. 
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Figure 2. A comparison of the continuous and discrete solution for small times and 
c=x=10 with a reflective boundary shows that the continuous and discrete solutions 

diverge for small times. 

Comparison of two individuals with a reflecting boundary 

Up to this point we have shown how the distribution of marker length of a population 
evolves over time. If we instead wanted to compare the probability that two individuals 
are related, let us suppose two individuals have a known common ancestor in time t. The 
probability of them having a common ancestor after t generations is 


We use 2rt because each is separated by rt from the common ancestor. In contrast, 
without the reflecting boundary the result is, 



(33) 


p(x v x 2 ,t) = exp(-2rt)l Xi _ X} (2rt) 


(34) 


In practice, the reflecting boundary has very little effect except in cases where the 
microsatellite length is small or when the time between common ancestors is very long. 
Figure 3 shows a comparison between two individuals with and without a reflecting 
boundary for xi=6 and x 2 =5. 



Figure 3. Comparison of the probability two individuals are related at time t for xi=6 and 
x 2 =5 for the reflecting and non-reflecting boundary cases. 


The symmetric n-step model 

It has been shown that for some loci a single step model may no longer be sufficient 
(Whittaker, et al., 2003) and an n-step model may be more appropriate. For this model (8) 
becomes 


J. N{m,t) = ^r k (N(m + 

01 Z Jk-1 


k, t ) - 2N(m, t) + N(m-k, t)) , 


(35) 


where rk are the mutation rates for the kth step size. Applying the same transformation as 
in (9) and solving the resulting differential equation we find 


so that 


N(co,t) = N(co, 0)exp 


-2^r k single 


k - 1 


(36) 


where 


oo 

N(m,t)= ^ N(m',0)G(m-m',t ), 

m '=- oo 


G(m,t) = ^—f exp(-io>m)exp 


-2^r k sm-\^k\t 


k - 1 


\da). 


(37) 

(38) 


As in the previous sections, it is understood that m is only defined for values of m greater 
than zero in nature. Therefore, this mathematical solution is only valid as long as the 
interaction with the m=0 boundary is minimal. Should this not be the case, one may 
easily apply a reflective boundary as we have in the previous section. 

Equation (38) is an amazingly simple result and can be calculated for any combination of 
step size. To test the validity of (38) we take the simple case of the two-step model, 
where 


G(m,t) = ^—J exp(-/mm)exp 


-2 r x sin" 


CO , 
2 1 ' 


exp[-2r 2 sin 2 (a>)t dco . (39) 


By expanding (39) in a Fourier series and using the convolution theorem for Fourier 
series (ZEMANIAN, 1965), it can be shown this reduces to 


G{m,t) = exp(-r 1 t)exp(-r 2 t) ^I m {r 2 t)l m _ 2m {r x t). (40) 

m'=-o° 

This is exactly the result found by WEHRHAHN 1975 using probability generating 
functions. 

In a recent paper WATKINS 2007 found a special case of an n-step solution using matrix 
methods and taking a limit as the matrix size approaches infinity. In his solution he found 
the special case where the rate distribution drops off geometrically (inspired by the work 
of Whittaker, et al., 2003) and found an explicit expression for this. We are able to easily 
derive that result using our methods instead. In our notation this distribution is 

h = K 1 - c i) c i k ~' > ( 41 ) 

where q is a parameter such that 0 < q < 1. The sum in (38) becomes (DWIGHT 1961) 



( 42 ) 


-2^ sin2 

k= 1 


u 



(l - g)(cos(o>) - q ) 
1 - 2^cos(co) + q 2 


Equation (38) becomes 




exp(-rt) nn . . 

— ^ — - J exp(-i(om)exp 


(l-g)(cos(a))-^r) 
l-2gcos(o>) + g 2 


rt 


Jo). 


(43) 


As demonstrated by WATKINS, this approaches the single step model in the limit as 
q—>0 and the infinite alleles model as q~>l. However, this is not the only possible 
model with that behavior. As an alternative to the geometric drop off in the mutation 
distribution with step size, we propose it could very well be a Gaussian instead, with the 
center of the curve representing no mutation, then dropping off as a Gaussian for 1 or 
more steps. That makes more sense if it is a random event. It is a simple matter to 
calculate this. With 


2r 


h = 


AM “I 


(44) 


where # 3 is the elliptic Theta function of the third kind (WHITTAKER and WATSON 
1927), the sum in (38) becomes 




r £ S111 


CO 


k= 1 


— k I = 7 > q sin 

2 ) #3(0,4)-l m 


4r 




V #3 \-,q -1 

°> k )- r UV r 

2 J“ # 3 (0,9)-l ' 


(45) 


In this case (38) becomes 


G(m,t ) = 6 ^ — —J exp(-/a>m)exp 


2k 


#, 


a> 


,q -1 


# 3 (0,#)-l 


rt 


dio 


(46) 


Just as in the previous example, this model also approaches the single step model in the 
limit q—> 0 and the infinite allele model in the limit 4 — * 1 . 

Conclusion 

We have demonstrated the utility in using the discrete diffusion approach in solving the 
SMM model. We feel this approach has many advantages to straight statistical methods. 
The first is the high degree of flexibility in specifying the type of solution, initial state (if 



any), and the natural way different types of boundaries are dealt with by the extension to 
continuum PDEs. Many of the techniques applied to continuum PDEs may be applied 
here. One of the most powerful of these is using transform methods. Translated into the 
discrete domain we use Fourier series techniques rather than Fourier transforms. We have 
demonstrated one may transform a partial difference equation to a first order differential 
equation for the generating function for the solution using very simple methods. Kimura 
and others have done much work using the continuum diffusion equation. We feel that by 
extending his methods to the discrete diffusion equation and using our techniques one 
may breath new life into an older method. 
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